17 Conditional Expectation

#ConditionalExpectation #TowerProperty #WaldIdentity #MSE #MAE #Median #ConditonalVariance

E [Y] = E [E [Y | X]] = E [\frac{X}{2}] = \frac{1}{2} E [X] = \frac{l}{4} .

1 Conditional Expectation

Assume that $f_{X, Y} (x, y)$ is well-defined for all $(x, y) \in R^{2}$ .

Conditional expectation

If $Y$ is continuous, then we can define conditional expectation $E [Y | X = x] = \int_{- \infty}^{+ \infty} y f_{Y | X = x} (y) d y .$
More generally, for suitable function $g : R \to R$ , then $E [g (Y) | X = x] = \int_{- \infty}^{+ \infty} g (y) f_{Y | X = x} (y) d y .$
And for suitable function $h : R^{2} \to R$ , $E [h (X, Y) | X = x] = E [h (x, Y) | X = x] .$

Conditional density $f_{Y | X = x}$ is defined in here.

If $Y$ is discrete, $E [Y | X = x] = \sum_{y} y P (Y = y | X = x) .$

Example

Throw two dies. Die 0 is fair, and die 1 is loaded. $Y$ is the number shown. $X$ is the die chosen. Define $P (Y = 6 | X = 1) = p, P (Y = j | X = 1) = \frac{1 - p}{5}, 1 \leq j \leq 5.$ Then $\begin{aligned} E [Y | X = 0] = & \sum_{y = 1}^{6} \frac{y}{6} = \frac{7}{2}, \\ E [Y | X = 1] = & 6 p + \sum_{y = 1}^{5} y (\frac{1 - p}{5}) = 3 (1 + p) . \end{aligned}$

For a fixed $x$ , $E [Y | X = x]$ satisfies the usual properties of expectation, e.g., linearity.

$E [Y | X = x] = ψ (x)$ is a function of $x$ .

$E [Y | X] = ψ (X)$ is a function of random variable $X$ , so itself is a random variable. $ψ (X) : Ω \to R, [ψ (X)] (ω) = ψ (X (ω))$ . So it can also be seen as a composition.

Theorem (Law of total expectation/Law of iterated expectation/Tower property)

For any random variable $Y$ , s.t. $E [| Y |] < \infty$ , $E [Y] = E [E [Y | X]] .$

The notation of expectation by default indicates what to integrate. So the inner layer $E [Y | X]$ is expectation over $Y$ , and the outer layer is over $X$ .

Proof

$\begin{aligned} E [E [Y | X]] = & \int E [Y | X = x] f_{X} (x) d x \\ = & \int [\int y f_{Y | x = x} (y) d y] f_{X} (x) d x \\ = & \int y [\int f_{Y | X = x} (y) f_{X} (x) d x] d y \\ = & \int y f_{Y} (y) d y = E [Y] . \end{aligned}$

Second last equality uses law of total probability.

For discrete case, replace $\int$ with $\sum$ .

Theorem (Wald's Identity)

Suppose $X_{1}, X_{2}, \dots$ is a sequence of i.i.d. random variables, with $E [X_{i}] = μ < \infty$ , and $N$ is another positive integer valued random variable, s.t. $N ⊥ ⊥ X_{1}, X_{2}, \dots$ , and $E [N] < \infty$ .
Let $S_{N} = X_{1} + \dots + X_{N}$ . Then $E [S_{N}] = μ E [N] .$

Proof

$E [S_{N}] = E [E [S_{N} | N]] = E [N E [X_{1}]] = E [N μ] .$

2 Important Applications

2.1 Statistical risk minimization

$Y$ is a random variable of interest (we want to predict). $g (X)$ is a prediction of $Y$ . Loss function is $L (Y, g (X))$ . Risk is $R (g) = E [L (Y, g (X))]$ . It's the expectation over both $X$ and $Y$ .
The goal is to find $g^{*} = \underset{g}{argmax} R (g)$ .

Example: Mean Squared Error ( #MSE )

$L (Y, g (X)) = (Y - g (X))^{2}$ . Then $\begin{aligned} R (g) = & E [(Y - g (X))^{2}] = E [E [(Y - g (X))^{2} | X]] . \end{aligned}$
Consider $h (c) = E [(Z - c)^{2}]$ , where $Z$ is random variable and $c$ is constant. Then $\begin{aligned} h (c) & = E [Z^{2}] - 2 c E [Z] + c^{2}, \\ h^{'} (c) & = - 2 E [Z] + 2 c, \\ h^{″} (c) & = 2. \end{aligned}$
Thus $c^{*} = \underset{c}{argmin} h (c) = E [Z]$ . Then $g^{*} (X) = E [Y | X]$ .

For Mean Absolute Error ( #MAE ), see below.

Median

For a random variable $X$ , a median of the distribution of $X$ is any value $m$ , s.t. $P (X \leq m) \geq \frac{1}{2}, P (X \geq m) \geq \frac{1}{2} .$

Every distribution has at least one median.

Median may not be unique.

Example

$x$	$1$	$2$	$3$	$4$
$P (X = x)$	$\frac{2}{10}$	$\frac{3}{10}$	$\frac{1}{10}$	$\frac{4}{10}$
Then any $m \in [2, 3]$ is a median.

Theorem (MAE Minimizer)

Let $Z$ be a random variable with a finite median $m$ . Then, $m$ minimizes $h (c) = E [| Z - c |]$ .

Back to risk minimization, $L (Y, g (X)) = | Y - g (X) |, R (g) = E [| Y - g (X) |] .$
Any function $g$ s.t. $g (x)$ is a conditional median of $Y$ given $X = x$ minimizes $R (g)$ .

3 Conditional Variance

We know $Var (Y) = E [Y^{2}] - (E [Y])^{2} = E [(Y - E (Y))^{2}] .$
And conditional variance is defined as $\begin{aligned} Var (Y | X = x) = & E [(Y - E [Y | X = x])^{2} | X = x] \\ = & E [Y^{2} | X = x] - E^{2} [Y | X = x] . \end{aligned}$

Claim (Law of Total Variance)

$Var (Y) = E_{X} [Var (Y | X)] + Var (E_{Y | X} [Y | X]) .$

Proof

$\begin{aligned} Var (Y) = & E [Y^{2}] - (E [Y])^{2} \\ = & E [E [Y^{2} | X]] - (E [E [Y | X]])^{2} \\ = & E [E [Y^{2} | X]] - E [(E [Y | X])^{2}] + E [(E [Y | X])^{2}] - (E [E [Y | X]])^{2} \\ = & E [E [Y^{2} | X] - (E [Y | X])^{2}] + Var (E [Y | X]) \\ = & E [Var (Y | X)] + Var (E [Y | X]) . \end{aligned}$

Application

Suppose $X_{1}, X_{2}, \dots$ is a sequence of iid RVs with $E [X_{i}] = μ < \infty$ and $Var (X_{i}) = σ^{2} < \infty$ . $N$ is a positive integer valued RV, s.t. $N ⊥ ⊥ X_{1}, X_{2}, \dots$ and $E [N] < \infty .$ Let $S_{N} = X_{1} + \dots + X_{N}$ .
Then $Var (S_{N}) = E [\underset{σ^{2} N}{\underset{⏟}{Var (S_{N} | N)}}] + Var (\underset{μ N}{\underset{⏟}{E [S_{N} | N]}}) = σ^{2} E [N] + μ^{2} Var (N) .$